23 research outputs found
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
Multilingual CheckList: Generation and Evaluation
The recently proposed CheckList (Riberio et al,. 2020) approach to evaluation
of NLP systems has revealed high failure rates for basic capabilities for
multiple state-of-the-art and commercial models. However, the CheckList
creation process is manual which creates a bottleneck towards creation of
multilingual CheckLists catering 100s of languages. In this work, we explore
multiple approaches to generate and evaluate the quality of Multilingual
CheckList. We device an algorithm -- Automated Multilingual Checklist
Generation (AMCG) for automatically transferring a CheckList from a source to a
target language that relies on a reasonable machine translation system. We then
compare the CheckList generated by AMCG with CheckLists generated with
different levels of human intervention. Through in-depth crosslingual
experiments between English and Hindi, and broad multilingual experiments
spanning 11 languages, we show that the automatic approach can provide accurate
estimates of failure rates of a model across capabilities, as would a
human-verified CheckList, and better than CheckLists generated by humans from
scratch
Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs
Large language models (LLMs) are at the forefront of transforming numerous
domains globally. However, their inclusivity and effectiveness remain limited
for non-Latin scripts and low-resource languages. This paper tackles the
imperative challenge of enhancing the multilingual performance of LLMs,
specifically focusing on Generative models. Through systematic investigation
and evaluation of diverse languages using popular question-answering (QA)
datasets, we present novel techniques that unlock the true potential of LLMs in
a polyglot landscape. Our approach encompasses three key strategies that yield
remarkable improvements in multilingual proficiency. First, by meticulously
optimizing prompts tailored for polyglot LLMs, we unlock their latent
capabilities, resulting in substantial performance boosts across languages.
Second, we introduce a new hybrid approach that synergizes GPT generation with
multilingual embeddings and achieves significant multilingual performance
improvement on critical tasks like QA and retrieval. Finally, to further propel
the performance of polyglot LLMs, we introduce a novel learning algorithm that
dynamically selects the optimal prompt strategy, LLM model, and embeddings per
query. This dynamic adaptation maximizes the efficacy of LLMs across languages,
outperforming best static and random strategies. Our results show substantial
advancements in multilingual understanding and generation across a diverse
range of languages